Machine Learning Final Presentation¶

Predicting Outcomes of US Supreme Court Oral Arguments¶

Team Members: Federico Dominguez, Chanteria Milner, Jessup Jong, and Michael Plunkett

Background¶

  • Dataset: ConvoKit's Supreme Court Oral Arguments Corpus
  • Souce: Court transcripts from oyez.org, voting information from Supreme Court Dataset
  • Goal: Predict case decision using case transcripts and NLP models

Applications¶

  • This project could be useful for academics, judges, litigants, policymakers, and the public.
  • The legal theory behind a machine learning project that predicts case rulings can be traced back to empirical legal studies, legal realism, and behavioralism.
  • Predictions of case rulings are vulnerable to criticism from legal positivism, critical legal studies, and law and economics.

Datasets¶

Six total datasets¶

  1. Cases
  2. Speakers
  3. Voters
  4. Advocates
  5. Conversations
  6. Utterances

Case Information¶

Includes infromation on each court case, such as:

  • Unique case ID
  • Year and title of case
  • Case petitioner and respondent
  • Winning side (1 = for petitioner) and decision date
In [203]:
ds.cases_stats
Out[203]:
counts percentages
win side for petitioner 284.0 66.510539
for respondent 143.0 33.489461
cases 427.0 NaN
courts 1.0 NaN
years (2014 to 2019) 6.0 NaN
petitioners 413.0 NaN
respondents 356.0 NaN

Speakers¶

Includes infromation on each speaker, such as:

  • Speaker name and unique speaker key
  • Speaker role and type (justice, advocate, nan)
In [204]:
ds.speakers_stats
Out[204]:
counts percentages
speaker type advocate (A) 8942.0 99.610115
justice (J) 35.0 0.389885
speaker names 8928.0 NaN
speaker keys 8977.0 NaN

Voters¶

Includes infromation on each vote and voter, such as:

  • Unique case ID
  • Voter key and vote side (judges only, 1=for petitioner)
In [205]:
ds.voters_stats.head(7)
Out[205]:
counts percentages
votes for petitioner 1912.0 60.659898
for respondent 1240.0 39.340102
justices 11.0 NaN
justice j__john_g_roberts_jr 363 0.661157
j__antonin_scalia 66 0.651515
j__anthony_m_kennedy 240 0.658333
j__clarence_thomas 364 0.532967

Advocates¶

Includes infromation on each advocate (non-judges), such as:

  • Unique case ID
  • Advocate ID and advocacy side(1 = for petitioner)
  • Advocate role
In [206]:
ds.advocates_stats
Out[206]:
counts percentages
side for petitioner 403.0 50.124378
for respondent 401.0 49.875622
total advocates 391.0 NaN
total roles 154.0 NaN
aggregate roles inferred 10.0 1.243781
for respondent 404.0 50.248756
for petitioner 390.0 48.507463

Conversations¶

Includes infromation on each conversation. There is one conversation per case,
and conversations are made up of individual utterances. Conversation information includes:

  • Unique case ID
  • Unique conversation ID
  • Winning side (1 = for petitioner)
In [207]:
conversations.head(2)
Out[207]:
id case_id winning_side
0 23291 2014_13-553 1
1 23252 2014_13-895 1

Utterances¶

Includes infromation on each utterance, such as:

  • Unique case ID
  • Corresponding conversation ID
  • Speaker key
  • Utterance text
In [208]:
cols = ["case_id", "speaker", "speaker_type", "conversation_id", "text"]
utterances.head(2).loc[:, cols]
Out[208]:
case_id speaker speaker_type conversation_id text
0 2014_13-553 j__john_g_roberts_jr J 23291 we'll hear argument next in case no. 13-553, t...
1 2014_13-553 andrew_l_brasher A 23291 thank you, mr. chief justice, and may it pleas...

Data Cleaning Steps¶

  1. Limited cases to those that ruled either for the petitioner or respondent (removed undetermined)
  2. Removed cases with no utterances
  3. Cleaned utterance text
  4. Filtered cases to include last 5 years of dataset (2014-2019)

Data Processing Steps¶

  1. Tokenized utterance text (spaCy)
    • Only Alphabets
    • Remove stop words (i.e., ['a', 'the', 'by'])
    • Lemmatize (running → run)
  2. Created utterances dataframes that includes tokenized text, case id, year, and winning side
    • Engineered featuers to include average number of sentences and average number of words per utterance
    • Dataframes correspond to:
      • All utterances within a case
      • Judge utterances within a case
      • Advocate (for petitioner) utterances within a case
      • Adversary (for respondent) utterances within a case

Pre-processed Datasets¶

All Utterances¶

In [209]:
cases_proc.head(2)
Out[209]:
case_id tokens avg_num_sentences avg_num_words year win_side
0 2014_13-553 [hear, argument, case, alabama, department, re... 2.447368 178.478947 2014 1
1 2014_13-895 [hear, argument, case, number, alabama, legisl... 2.432203 184.368644 2014 1

Judge Utterances¶

In addition to standard columns, includes columns on count of advocates for petitioner or respondent.

In [210]:
judges_proc.head(2)
Out[210]:
case_id tokens avg_num_sentences avg_num_words year win_side
0 2014_13-553 [hear, argument, case, alabama, department, re... 1.682692 94.5000 2014 1
1 2014_13-895 [hear, argument, case, number, alabama, legisl... 2.039062 132.9375 2014 1

Advocate Utterances¶

In [211]:
advocate_proc.head(2)
Out[211]:
case_id tokens avg_num_sentences avg_num_words year win_side
0 2014_13-553 [hear, argument, case, alabama, department, re... 2.212329 163.547945 2014 1
1 2014_13-895 [mr, chief, justice, court, alabama, employ, r... 2.326241 164.439716 2014 1

Adversary Utterances¶

In [212]:
adversary_proc.head(2)
Out[212]:
case_id tokens avg_num_sentences avg_num_words year win_side
0 2014_13-553 [handpicked, business, transport, good, motor,... 3.227273 228.022727 2014 1
1 2014_13-895 [hear, argument, case, number, alabama, legisl... 2.589474 213.947368 2014 1

Model and Evaulation Overviews¶

  • Logistic Regression
  • Gradient Boosted Tree Model
  • Random Forest

Logistic Regression¶

  • Logistic regression is a binary classification model that predicts court case outcomes based on a 'bag of words'.

  • It assumes a linear relationship between variables and doesn't consider precedents or social trends.

  • The current model did not use regularization, which is used to help prevent overfitting.

  • Four datasets were used, and models based on advocate and adversary utterances achieved higher accuracies than those using judge utterances or a combination of all utterances.

  • The logistic regression model provided insights into the predictive power of different utterances, indicating that advocate and adversary statements are more predictive than the more than judge utterances, and all utterances aggregated together for case outcomes.

Gradient Boosted Tree Model¶

  • XGBoost, short for eXtreme Gradient Boosting, is an ensemble model that uses gradient boosting with decision trees to minimize the loss function.

  • It sequentially grows trees, considering the residuals of the previous tree and reweighting the observations.

  • Unlike Random Forest, XGBoost adjusts the model on every iteration using the previous residuals as the new target variable, allowing it to learn from mistakes and improve.

  • Limitations of XGBoost include difficulty in interpretation due to its use of multiple trees and its predisposition to overfitting if parameters are not tuned properly.

  • XGBoost performs better than single models but requires a finetuning process to determine the best hyperparameters for a specific context.

Random Forest¶

  • Random Forest is an ensemble model with multiple decision trees that combines the bagging and random feature selection methods.

  • Limitations of the Random Forest model include reduced interpretability compared to decision trees and the need for more time and resources for training due to bagging and random feature subsets.

  • The predictions based on the bag of words CountVectorizer solely consider word frequency and may not capture complex linguistic relationships.

  • The Random Forest model was chosen to capture complex interactions in unstructured data, avoid overfitting, and rank word importance.

  • You can assess the model's accuracy by predicting case outcomes and examining its word importance metrics.

  • Cross-validation and testing on out-of-sample data help to reveal how well the model generalizes and avoids overfitting.

Evaluation Metrics: Accuracy and F1 Score¶

  • Initally, we wanted to provide a general benchmark for accuracy.
  • Given that the majority of cases (approximately 67%) were voted in favor of the petitioner (win_side=1), we also evaluated our models using the F1 score to account for this imbalance.
In [214]:
simple_bar_plot(xlabel, ylabel, labs, y)

Logistic Regression¶

Default Parameters¶

  • Maximum number of features: 5000
  • Maximum number of iterations: 1000
  • Test size: 0.20

Accuracies and F1 Score¶

In [216]:
disp_accuracy(lr_acc, labs=list(lr_acc["dataset"]))

Confusion Matrices¶

Logistic Regression Model - All Utterances¶

In [217]:
disp_conf_matrix(lr.confusion_matrix["case_aggregations"])

Logistic Regression Model - Judge Utterances¶

In [218]:
disp_conf_matrix(lr.confusion_matrix["judge_aggregations"])

Logistic Regression Model - Advocate Utterances¶

In [219]:
disp_conf_matrix(lr.confusion_matrix["advocate_aggregations"])

Logistic Regression Model - Adversary Utterances¶

In [220]:
disp_conf_matrix(lr.confusion_matrix["adversary_aggregations"])

Hyperparameter Tuning - Maximum Iterations¶

In [228]:
plot_accuracy_scores(
    max_iter_f1_melted, maxiter, lr_label, accuracy_metric="F1"
)
In [227]:
plot_accuracy_scores(max_iter_melted, maxiter, lr_label)
In [229]:
plot_accuracy_scores(et_df, maxiter, lr_label, accuracy_metric="Execution Time")

Max Features Finetuning¶

In [223]:
plot_accuracy_scores(max_feature_lr_melted, maxfeat, lr_label)
In [224]:
plot_accuracy_scores(
    max_feature_f1_lr_melted, maxfeat, lr_label, accuracy_metric="F1"
)
In [225]:
plot_accuracy_scores(et_df, maxfeat, lr_label, accuracy_metric="Execution Time")

Random Forest¶

Default Parameters¶

  • Maximum depth: None
  • Maximum number of features: 5000
  • Number of trees: 100
  • Test size: 0.20

Accuracies and F1 Score¶

In [231]:
disp_accuracy(rf_acc, labs=list(rf_acc["dataset"]))

Confusion Matrices¶

Random Forest Model - All Utterances¶

In [232]:
disp_conf_matrix(rf.confusion_matrix["case_aggregations"])

Random Forest Model - Judge Utterances¶

In [233]:
disp_conf_matrix(rf.confusion_matrix["judge_aggregations"])

Random Forest Model - Advocate Utterances¶

In [234]:
disp_conf_matrix(rf.confusion_matrix["advocate_aggregations"])

Random Forest Model - Adversary Utterances¶

In [235]:
disp_conf_matrix(rf.confusion_matrix["adversary_aggregations"])

Hyperparameter Tuning - Maximum Tree Depth¶

In [246]:
plot_accuracy_scores(max_depth_rf_melted, maxdepth, rf_label)
In [247]:
plot_accuracy_scores(
    max_depth_f1_rf_melted,
    maxdepth,
    rf_label,
    accuracy_metric="F1",
)
In [248]:
plot_accuracy_scores(
    et_df,
    maxdepth,
    rf_label,
    accuracy_metric="Execution Time",
)

Max Features Finetuning¶

In [238]:
plot_accuracy_scores(max_feature_melted, maxfeat, rf_label)
In [239]:
plot_accuracy_scores(
    max_feature_f1_melted, maxfeat, rf_label, accuracy_metric="F1"
)
In [240]:
plot_accuracy_scores(et_df, maxfeat, rf_label, accuracy_metric="Execution Time")

Number of Trees Finetuning¶

In [242]:
plot_accuracy_scores(num_trees_melted, ntree_lab, rf_label)
In [243]:
plot_accuracy_scores(
    num_trees_f1_melted,
    ntree_lab,
    rf_label,
    accuracy_metric="F1",
)
In [244]:
plot_accuracy_scores(
    et_df,
    ntree_lab,
    rf_label,
    accuracy_metric="Execution Time",
)

Gradient Boosted Tree¶

Default Parameters¶

  • Maximum number of features: 5000
  • Test size: 0.20
  • Maximum depth: 7
  • Number of estimators: 100
  • Learning rate: 0.3
  • Subsamples: 1

Accuracies and F1 Score¶

In [250]:
disp_accuracy(xg_acc, labs=xg_acc["dataset"])

Confusion Matrices¶

Gradient Boosted Tree Model - All Utterances¶

In [251]:
disp_conf_matrix(xg.confusion_matrix["case_aggregations"])

Gradient Boosted Tree Model - Judge Utterances¶

In [252]:
disp_conf_matrix(xg.confusion_matrix["judge_aggregations"])

Gradient Boosted Tree Model - Advocate Utterances¶

In [253]:
disp_conf_matrix(xg.confusion_matrix["advocate_aggregations"])

Gradient Boosted Tree Model - Adversary Utterances¶

In [254]:
disp_conf_matrix(xg.confusion_matrix["adversary_aggregations"])

Hyperparameter Tuning - Learning Rate¶

In [259]:
plot_accuracy_scores(et_df, eta, xg_label, accuracy_metric="Execution Time")
In [257]:
plot_accuracy_scores(eta_melted, eta, xg_label)
In [258]:
plot_accuracy_scores(eta_f1_melted, eta, xg_label, accuracy_metric="F1")

Maximum Tree Depth Finetuning¶

In [261]:
plot_accuracy_scores(max_depth_melted, maxdepth, xg_label)
In [262]:
plot_accuracy_scores(
    max_depth_f1_melted, maxdepth, xg_label, accuracy_metric="F1"
)
In [263]:
plot_accuracy_scores(
    et_df, maxdepth, xg_label, accuracy_metric="Execution Time"
)

Subsample Finetuning¶

In [265]:
# Plot accuracy scores
plot_accuracy_scores(subsample_melted, subsamp, xg_label)
In [266]:
plot_accuracy_scores(
    subsample_f1_melted, subsamp, xg_label, accuracy_metric="F1"
)
In [267]:
plot_accuracy_scores(et_df, subsamp, xg_label, accuracy_metric="Execution Time")

Model Comparisons - Advocate Utterances¶

In [9]:
disp_accuracy(model_comp, labs=model_comp["model"])

Thank you!¶

Any Questions?¶